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ABSTRACT 

In this paper, we propose an approach for Relationship Ex- 
traction (RE) based on labeled graph kernels. The kernel 
we propose is a particularization of a random walk kernel 
that exploits two properties previously studied in the RE 
literature: (i) the words between the candidate entities or 
connecting them in a syntactic representation are particu- 
larly likely to carry information regarding the relationship; 
and (ii) combining information from distinct sources in a 
kernel may help the RE system make better decisions. We 
performed experiments on a dataset of protein-protein in- 
teractions and the results show that our approach obtains 
effectiveness values that are comparable with the state-of- 
the art kernel methods. Moreover, our approach is able to 
outperform the state-of-the-art kernels when combined with 
other kernel methods. 

Categories and Subject Descriptors 

H. 2 [Information Storage and Retrieval]: Content Anal- 
ysis and Indexing - Linguistic Processing 
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Algorithms 
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I. INTRODUCTION 

With the increasing use of Information Technologies, the 
amount of unstructured text available in digital data sources 
(e.g., email communications, blogs, reports) has grown at an 
impressive rate. These texts may contain vital knowledge to 
Human decision making processes. However, it is unfeasible 
for a human to analyze big amounts of unstructured infor- 
mation in a short time. In order to solve this problem, a 
typical approach is to transform unstructured information 
in digital sources into a previously defined structured for- 
mat. 



Information Extraction (IE) is the scientific area that studies 
techniques to extract semantically relevant segments from 
unstructured text and represent them in a structured format 
that can be understood/used by humans or programs (e.g., 
decision support systems, interfaces for digital libraries). In 
the past few years, there has been an increasing interest 
in IE, from industry and scientific communities. In fact, 
this interest led to huge advances in this area and several 
solutions were proposed in applications such as Semantic 
Web [3] and Bioinformatics [l4| [2]. 

Regardless of the application domain, an IE activity can be 
modeled as a composition of the following high-level tasks 
ESI: 



• Segmentation: divides the text into atomic segments 
(e.g., words). 

• Entity recognition: assigns a class (e.g., organiza- 
tion, person) to each segment of the text. Each pair 
(segment, class) is called an entity. 

• Relationship extraction: determines relationships 
(e.g., born_in, works_for) between entities. 

• Entity normalization: converts entities into a stan- 
dard format (e.g., convert all dates to a pre-defined 
format). 

• Co-reference resolution: determines which entities 
represent the same object/individual in the real world 
(e.g., IBM is the same as "Big Blue"). 



In the last decade, several techniques to increase the accu- 
racy of these tasks were proposed. In this paper, we focus 
only on the Relationship Extraction (RE) task. The ap- 
proaches that are typically used for RE can be divided into 
two major groups: (i) handcrafted solutions, in which the 
programs are manually specified by the user through a set 
of rules; and (ii) Machine Learning solutions, in which the 
programs are automatically generated by a machine either 
by explicitly producing rules or by generating a statistical 
model that is able to produce extraction results with regard 
to a set of characteristics of the input text. 

Most of the first approaches for RE were based on hand- 
crafted rules [3 15 . Typically, they exploited common pat- 
terns and heuristics to extract the desired relationships from 



the results of complex Natural Language Processing chains. 
These solutions were able to produce good results in several 
specific domains. However, they need a lot of human effort 
to produce rules for distinct domains. 

To overcome this problem of handcrafted solutions, the ap- 
plication of Machine Learning to RE started to receive a lot 
of attention. Typically, machine learning techniques used 
for RE are supervised. However, some works have exploited 
semi-supervised [6]|T] |11||13| and unsupervised [10||12| tech- 
niques. Supervised approaches to RE are typically based on 
classifiers that are responsible for determining whether there 
is a relationship or not between a set of entities. 

There are two major lines of works in supervised approaches 
to RE: (i) feature-based methods, which try to find a good 
set of features to use in the classification process; and (ii) 
kernel methods, which try to avoid the explicit computation 
of features by developing methods that are able to com- 
pare structured data (e.g., sequences, graphs, trees). Even 
though feature-based methods for RE work well [16] , there 
has been an increasing interest in exploiting kernel-based 
methods, due to the fact that sentences are better described 
as structures (e.g., sequences of words, parsing trees, depen- 
dency graphs). 

In this paper, we describe a new supervised approach to RE 
that is based on labeled dependency graph representations 
of the sentences. The advantage is that a representation of a 
sentence as a labeled dependency graph contains rich seman- 
tic information that, typically, contains useful hints when 
discriminating whether a set of entities in a sentence are 
related. The solution we propose uses kernels to deal with 
these structures. We propose the application of a marginal- 
ized kernel to compare labeled graphs [17] . This kernel is 
based on walks on random graphs and is able to exploit an 
infinite dimensional feature space by reducing its computa- 
tion to the problem of solving a system of linear equations. 
In order to make this graph kernel suitable for RE, we mod- 
ified the kernel to exploit the following properties that were 
previously introduced proposals of kernels for RE: (i) the 
words between the candidate entities or connecting them in 
a syntactic representation are particularly likely to carry in- 
formation regarding the relationship [7]; and (ii) combining 
information from distinct sources in a kernel may help the 
RE system make better decisions [13] . 

In order to evaluate the model we propose, we performed 
some experiments with a biomedical dataset called Aimed 
[8] . This dataset is composed of several abstracts from Biol- 
ogy papers. The documents are annotated with interaction 
relationships between proteins. The results show that the 
performance of our approach is comparable to the state-of- 
the-art. Morever, when combining our kernel with other 
kernel methods, we were able to outperfom other state-of- 
the-art kernel methods. 

The rest of the paper is organized as follows. In Section [2] 
we present the related work. Section [3] defines the problem 
that we are trying to solve. In Section [4] we describe our 
method for relationship extraction. In Section [5] we report 
on the experiments performed. Finally, Section [6] presents 
the conclusions and some topics for future work. 



2. RELATED WORK 

The most relevant works in the topic of this paper are the 
ones that propose kernel methods for RE. In the past ten 
years, several autors proposed kernels for different syntactic 
and semantic structures of a sentence. One of the first ap- 
proaches, presented in 2003 by Zelenko et al. f20] , is a kernel 
based on shallow parse tree representation of sentences. This 
approach had some problems in what concerns the vulnera- 
bility to parsing errors. In order to overcome these problems, 
Culotta and Sorensen [9] proposed a generalization of this 
kernel that, when combined with a bag-of-words kernel, is 
able to compensate the parsing errors. 

In 2005, Bunescu and Mooney [7j proposed a kernel based on 
the shortest path between entities in a dependency graph. 
The kernel was based on the hypothesis that the words be- 
tween the candidate entities or connecting them in a syntac- 
tic representation are particularly likely to carry information 
regarding the relationship. The problem of this kernel is the 
fact that it is not very flexible when comparing candidates, 
which leads to very low values of recall when the training 
data is too small. The same authors proposed a different 
kernel based on subsequences [§]. The subsequences used 
in this approach could be combinations of words and other 
tags (e.g., POS tags, Wordnet Synsets). The results of this 
kernel are very interesting and even today it is still pointed 
out as a kernel with a very good performance in RE tasks. 

Giuliano et al. [li] proposed in 2006 a kernel based only 
on shallow linguistic information of the sentences. The idea 
was to exploit two simple kernels that, when combined, were 
able to obtain very interesting results. The global context 
kernel compares the whole sentence using a bag-of-n-grams 
approach. The frequencies of the n-grams are computed 
in three different locations of the sentence: (i) before the 
first entity; (ii) between the two entities; and (Hi) after 
the second entity. The local context kernel evaluates the 
similarity between the entities of the sentences as well as 
the words in a window of limited size around them. The 
advantage of this kernel is its simplicity since it does not 
need deep Natural Language Processing tools to preprocess 
the sentences in order to compute the kernel. However, its 
major advantage may very well be a big disadvantage since 
it is not able to exploit rich syntactic/semantic information 
like a parsing tree or a dependency graph representation 
of a sentence (which are structures that can be useful for 
determining whether a set of entities are related). 

In 2008, Airola et al. [3] presented a kernel that combines 
two graph representations of a sentence: (i) a labeled de- 
pendency graph; and (ii) a linear order representation of 
the sentence. The kernel considers all possible paths con- 
necting any two vertices in the graph. The results obtained 
are comparable with the state-of-the-art results. However, 
this kernel is very demanding in terms of computational re- 
sources. 



In 2010, Tikk et al. 19 performed a study to analyze how a 



very comprehensive set of kernels for relationship extraction 
performs when dealing the task of extracting protein-protein 
interactions. Even though they were not able to determine 
a clear winner in their comparison, they were still able to 
outline some very interesting conclusions. First, they notice 
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Figure 1: A sentence from a biomedical text con- 
taining three references to proteins ( TRADD, RIP 
and Fas) and two interaction relationships between 
them (TRADD interacts with RIP and RIP inter- 
acts with Fas). 



that kernels based on dependency parsing tend to obtain 
better results than kernels based on tree parses. Moreover, 
they show that a simple kernel, like [14] , can still obtain 
results that are at the level of the best kernels based on 
dependency parsing. 
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Figure 2: Candidates generated from the sentence 
of Figure [l| 



3. PROBLEM DEFINITION 

In general, the problem of finding an n-ary relationship be- 
tween entities can be seen as a classification problem for 
which the input is a set of n entities and the output is the 
type of relationship between them or an indication that they 
are not related at all. 

With this definition, given a text document with all the 
entities identified, the candidate results are all the sets of n 
entities that exist in the text. This approach would generate 
a huge set of candidates among which very few correspond 
to actually related entities. For this reason, this configura- 
tion would potentially lead to some performance issues (due 
to the huge amount of candidates) and to some problems in 
terms of accuracy (due to the unbalancement of the data). 
To avoid these issues, we exploit an heuristic that is typi- 
cally used in related works, which consists in limiting the 
candidates to sets of entities that can be found in the same 
sentence. 

This way, for one sentence with k entities, the number of 
candidates generated for a n-ary relationship is given by the 
number of combinations of the k entities, selected n at a 
time, i.e. (*). For instance, consider the sentence in Fig- 
ure [l] in which we present an example of a sentence from 
a biomedical text. Suppose that we aim at finding interac- 
tion relationships between proteins. This sentence contains 
three identified proteins: TRADD, RIP and Fas. Moreover, 
there are two interaction relationships between these enti- 
ties: TRADD interacts with RIP and RIP interacts with 
Fas. 

Given the fact that a protein interaction is a binary rela- 
tionship, we have a total of (^) = 3 candidates, which are 
presented in Figure [2] 

Note that it is also possible to use other heuristics to reduce 
the number of candidates. For instance, in some cases, we 
may have knowledge about the types of entities that can 
fulfill a given role in a relationship (e.g. in a relationship 
between a company and its CEO, it is known that one of 
the entities must be a a company and the other, a person). 
Even though these heuristics typically involve some type of 
prior knowledge about the application domain, they tend to 



drastically reduce the space of candidates. This fact makes 
the relationship extraction process a lot easier and helps it 
produce better results since some of the candidates involving 
entities that are never related are not used. 

Assuming a set of candidate results, Figure [3] describes how 
the RE extraction task can be represented as a classifica- 
tion problem. The problem can be divided into two main 
phases: training and execution. In the training phase, the 
objective is to automatically generate a statistical model that 
is able to determine whether a given candidate corresponds 
to a relationship. In order to produce this model, some 
training examples must be provided to a learning algorithm 
(e.g., solving a quadratic optimization problem in the case 
of a SVM classifier). These examples are generated in the 
same fashion as the candidates, however, they include an 
additional label that indicates whether they correspond to 
a relationship. 

The execution phase aims at classifying each unlabeled can- 
didate from new untagged documents as containing a rela- 
tionship or not. This decision is made using the statistical 
model created in the training phase and a classification al- 
gorithm. In the end of the process, the sets of entities in the 
candidates that are classified as containing a relationship are 
returned. 

4. METHOD 

In this Section, we present the proposed kernel method. We 
start by describing the basic idea behind kernel methods for 
RE in Section [4.1 1 Then, in Section [4. 2[ we propose a rep- 
resentation of the candidate sentences as labeled graphs. In 
Section |4~3"l we explain the random walk kernel that was used 
as the basis for our RE kernel. In Section |4.4[ we present 
the parameters used to modify the random walk kernel for 
our problem. Finally, in Section [4. 5[ we propose our kernel 
for RE. 

4.1 Kernel Methods for Relationship Extrac- 
tion 

In some cases, input objects of a classifier may not be easily 
expressed via feature vectors (e.g., if the range of possible 
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POS tags, the lemma of the word and capitalization patterns 
(however, due to simplicity, we represent only one additional 
feature in the graph of Figure [4] which is the POS tag). We 
could use other potentially useful features like hypernyms or 
synsets extracted from the WordNet. The edges represent 
semantic relationships between the words. The type of the 
semantic relationship is represented by the edge label. 

Recall that, for a given sentence with k entities, when search- 
ing for a n-ary relationship, the number of candidates that 
are generated is ( ). In terms of structure (vertexes and 
edges), the corresponding dependency graph for each of these 
candidates is always the same. If we used only structural in- 
formation to compare candidates we could have a problem 
because we would not be able to distinguish between differ- 
ent candidates generated from the same sentence that are 
expected to produce different classification results. 




Tagged 
Candidates 



Figure 3: Representation of a RE task as a classifi- 
cation problem. 

features is too wide or if the nature of the object does not 
make it clear how to choose the features). Therefore, the 
feature engineering process may become painfully hard and 
lead to high-dimensional feature spaces and consequently to 
computational problems. Kernel methods are an alternative 
to feature-based methods that can be used to classify objects 
while keeping their original representation. 

In kernel methods, the idea is to exploit a similarity func- 
tion (kernel) between input objects. This function, with the 
help of a discriminative machine learning technique, is used 
to classify new examples. In order for a similarity func- 
tion to be an acceptable kernel function, K(x,y), it must 
respect the following properties: (1) it must be a bidimen- 
tional function over the object space X to a number in 
[0,+oo[ (K : X x X — > [0,+oo[); (ii) it must be symmetric 
QJx,ycx, K(x, y) = K(y,x))\ and (Hi) it must be positive- 
semidefinite (V xl , X2l ... Xn ex , the nxn matrix (K(xi,Xj))ij is 
positive-semidefinite) . 

RE is an example of a problem for which the inputs may 
not be easily expressed via feature vectors. As described in 
Section [3] the inputs of the learning and classification al- 
gorithms in supervised RE tasks are sentences. Typically, 
sentences are better described as structures (e.g., sequences 
of words, parsing trees, dependency graphs) and it is inter- 
esting to use these representations directly. 

4.2 Labeled Graph Representation of the Sen- 
tences 

In our approach, we assume that the inputs of the learning 
and classification algorithms are labeled graph representa- 
tions of the candidate sentences (see Figure[4]|. In this graph, 
each vertex is associated with a word in the sentence and is 
enriched with additional features of the word. In our repre- 
sentation, the additional features include POS tags, generic 



For this reason, we used heuristics to enrich our graph repre- 
sentation. First, the entities that are candidate to be related 
can provide very important clues for detecting if there is a 
relationship [14]. We define a predicate isEntity(v), which 
receives a vertex of the graph and determines whether it is 
an entity. With this, it is possible for a kernel to use this 
information in the computation of the similarity between 
graphs. Second, the shortest path hypothesis, formalized in 
\7\, states that the words between the candidate entities or 
connecting them in a syntactic representation are particu- 
larly likely to carry information regarding their relationship. 
Analogously to [7] and [2], we exploited this hypothesis by 
defining a predicate called inSP(x) that receives as input a 
node or an edge of the graph and returns true if they belong 
to the shortest path between the two entities of the graph. 
Like in the case of the entities, this allows the kernel to treat 
these vertexes and edges in a special fashion way. 

4.3 Random Walks Kernel 

The random walk kernel used as a basis of our RE kernel 
was defined in [17] as a marginalized kernel between labeled 
graphs. The basic idea behind this kernel is the following 
one: given a pair of graphs, perform simultaneous random 
walks between the vertexes of the graphs and count the num- 
ber of matching paths. In a more formal way, the objective 
of the kernel is to compute the expected number of matching 
paths between the two graphs. 

In order to explain this kernel, we start by defining the graph 
that is expected as input. Let G be a labeled directed graph 
and \G\ be the number of vertexes in the graph. All ver- 
texes in the graph are labeled and Vi denotes the label of 
vertex i. The edges of the graph are also labeled and dj 
denotes the label of the edge that connects vertex i and ver- 
tex j. Moreover, we assume two kernel functions, K v (v,v') 
and K e (e, e') that are kernel functions between vertexes and 
edges respectively. Figure [5] presents an example of a graph 
that can be used as input of the random walk kernel. 

Additionally to the graph, this kernel also assumes the exis- 
tence of three probability distributions: (i) the initial prob- 
ability distribution, p 3 (h), that corresponds to the proba- 
bility that a path starts in the vertex h; (it) the ending 
probability, p q (h), that corresponds to the probability that 
a path ends in the vertex h; and (Hi) the transition proba- 




Figure 4: Graph Representation of Candidate #1 presented in Figure [2] Each node is composed by the 
word and its POS tag. The candidate entities are represented in black. We also represent the shortest path 
between the two entities with dark edges. The nodes that cross the shortest path are represented in gray. 




Figure 5: Example of a labeled graph that can be 
used as input of the Random walk kernel. 



bility, pt(hi\hi-i), that corresponds to the probability that 
we walk from vertex hi-i to vertex hi. With all these prob- 
abilities defined, it is possible to compute the probability of 
a path h = [hi, /12, hi] in the graph G with Equation]!] 



p(h|Gf) = Ps (hi) n Pt(hi\hi-i) Pt (hi) 



(1) 



As we stated before, the objective of the kernel is to compute 
the expected number of matching paths between two input 
graphs. Let us define a kernel to compute the number of 
matching subpaths between two paths of different graphs. 
We assume that if the paths have different lenghts, then 
there is no match between them. If the paths have the same 
length, the matching between them is given by the product 
of the vertex and edge kernels. Assuming we have two paths 
h and h' from two different graphs G and G' , then the kernel 
between z = (h, G) and z' = (h',G') is given by Equation 

m 



through all the possible pairs of paths in the kernels) , would 
be computational expensive for acyclic graphs and impossi- 
ble for graphs containing cycles. However, [l7] demonstrated 
that this kernel can be efficiently computed by solving a sys- 
tem of linear equations. In order to define this system of 
linear equations, let us first define the following matrices: 



T=\ 
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8(1,2') X 



S = s(l,\G'\') 

8(2,1') 
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Where 



a(hi, h[) = p s (h 1 )p s ,(h' 1 )K v (v hl , v' h , ) 



q{hi,h[) = Vq(hi)p q i{h[) 



(4) 



(5) 



t{hi--y, h' t _ t , hi, hi) = pt(/i i |/i i _i)p t (/ii|/ij_ 1 )x 

K v (v h .,v' h ,)K(e h ._ lh .,e' h , ,) (6) 



{0 ill ^l' 

K, (v hl , v' h{ ) ITU K v (v hi , v' K ) x if I = l> 

(2) 



The system of linear equations that we need to solve is pre- 
sented in Equation [7] 



(I -T)X = Q (7) 



Given K z {z,z') and p(h\G), we can compute the expected 

number of matching paths between the two graphs with where X is the solution of the system and I is the identity 
Equation [3] matrix. [l7] demonstrated that the random walk kernel be- 

tween graphs, K(G, G'), can be given by Equation|8] 

K(G,G') = E[if z (z, 2')] =J2Y^ K *( z > z ">P( h \ G M h '\G') (3) 

h h' K(G, G') =< S,X > (8) 



Computing this kernel using a naive approach (i.e., going 



where < S, X > is the inner product between two vectors. 



4.4 Parameters of the Random Walks Kernel 
for Relationship Extraction 

In Section |4.3[ we described a kernel for generic labeled 
graphs. The kernel we propose is a particularization of this 
one applied to RE. 

Recall that our representation of a sentence, presented in 
Section [4. 2| corresponds to a labeled graph where the labels 
of the vertexes are vectors of tags (containing the word itself, 
its lemma, POS tags, and ortographic patterns) and the 
labels of the edges contain simply the type of the semantic 
relationship between the two entities. Moreover, each vertex 
and edge contains information about whether it is in the 
shortest path between the two entities. The vertexes also 
contain information about whether they are entities. 

In order to use the random walk kernel described in Section 
|4.3| we had to define the kernels between the vertex labels 
and the kernels between the edge labels. Given the fact that 
the labels of the vertexes are simply vectors of attributes of 
the word associated with the vertex, we can use the normal- 
ized linear kernel presented in Equation [9] 



K v {v,v) = 



c(v, v') 



(9) 



where c(v,v') counts the number of common features be- 
tween the labels of v and v'. 

In order to guarantee that entities can only match in a ran- 
dom walk with other entities and that vertexes contained in 
a shortest path can only match with vertexes contained in 
a shortest path, we actually used a slightly modified version 
of the kernel presented in Equation [9] The modified version 
is presented in Equation |10| 



K v (v,v') = 



if inSP(v) = inSP(v') A 

isEntity(v) — isEntity(v ) 

otherwise 



(10) 



The kernel between the edges is very simple. Since the label 
for the edges is only a string indicating the type of semantic 
relationship between the two words. We define this kernel 
in Equation 1 11| 



K c (e,e) = 8(e = e ) 



(11) 



where, 8 is a function that returns 1 if its argument holds 
and otherwise. 

Once again, since we want to differentiate edges in the short- 
est path from edges outside the shortest path, we added a 
simple modification to the kernel that is presented in Equa- 
tion [12] 



K e (e, e) 



-it' 



e') if 



inSP{e) = 
otherwise 



iSP{e') 



Finally, we still need to define the probability distributions 
necessary to compute the random walk kernel in our prob- 
lem. Due to the fact that we have no prior knowledge 
about the probability distributions, wc follow the solution 
proposed in [17| and consider that all the distributions are 
uniform. 

4.5 Random Walks Kernel for Relationship 
Extraction 

Using the random walk kernel presented in Section [4. 3| and 
the parameterization for the RE problem proposed in Sec- 
we produced three variations of the kernel: (i) Full 



4.4 



tion 

Graph Kernel; (it) Shortest Path Kernel; and (Hi) No Short- 
est Path Kernel. 

The Full Graph Kernel (FGK) corresponds to the applica- 
tion of the random walk kernel to the whole structure de- 
scribed in Section |4.2| The idea of this kernel is to capture 
the whole view of the graph structure (which is the same for 
all the candidates generated from a given sentence) but still 
be able to capture the similarity between interesting prop- 
erties that are specific to the candidates (i.e., shortest path 
and entities information). 

The Shortest Path Kernel (SPK) aims at exploiting the 
shortest path hypothesis presented in [7]. The idea is to 
apply the random walk kernel to the subgraph that corre- 
sponds to the shortest path between the entities. 

The No Shortest Path Kernel (NSPK) is a variation of FGK 
where the nodes and edges that belong to the shortest path 
are not marked as such. For this reason, the only thing that 
distinguishes the graph structures for candidates generated 
from a given sentence are the entities. 

The kernel we propose is actually based on a very interest- 
ing property of kernels: the linear combination of several 
kernels is itself a kernel. We used this approach because 
several works empirically demonstrated that combining ker- 
nels using this approach typically improves the performance 
of individual kernels [91 fl41 . 



(12) 



5. EXPERIMENTS 

In this Section, we present the experiments performed in or- 
der to evaluate our solution for RE and report on the results 
obtained. First, we present the relationship extraction task. 
Then, in Section |5.2| we describe the dataset. Section |5.3| 
presents the metrics used to evaluate our kernel and Section 
|5.4| presents the method used to support our claims in what 
concerns the comparison of the kernels. In Section |5.5[ we 
point out some implementation details of our experiments. 
In Section [5l3| we report on the performance of the individual 
kernels presented in Section |4.5| and in Section [5.7|w e report 
on the combination of these kernels. In Section |5,8| we per- 
form a comparison between our solution and other methods. 
Finally, in Section [5.9| we report on some experiments when 
combining our kernel with other methods. 

5.1 Relationship Extraction Task 

In our evaluation, we focused exclusively on the extraction 
of relationships that correspond to protein-protein interac- 
tions. The idea is that, given pairs of entities there is a 
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Table 1: Number of training and testing candidates 
for each split 



where C represents the number of relationships correctly ex- 
tracted, / represents the number of relationships incorrectly 
extracted. 

The disadvantage of precision is that we can get high results 
extracting only information that we are sure to be right and 
ignoring information that are in the text and may be rele- 
vant. 



relationship between them if the text indicates that the pro- 
teins have some kind of biological interaction. 

5.2 Dataset 

We performed our experiments over a protein-protein inter- 
action dataset called AlmecO] This dataset has been used in 
previous works to evaluate the performance of relationship 
extraction systems in the task of extracting protein-protein 
interactions [8] |14| [2] . Aimed is composed by 225 Medline 
abstracts from which 200 describe interactions between pro- 
teins and the other 25 do not refer to any interaction. The 
total number of interacting pairs is 974 and the total number 
of non-interacting pairs is 4072. 

During the evaluation of our model we used a cross-validation 
strategy that is based on splits of the Aimed dataset at the 
level of document [8] [2] . Table [I] presents the number of 
positive and negative candidates that can be found in the 
training and testing data of each split. 

5.3 Evaluation Metrics 

Our experiments are focused on measuring the quality of the 
results produced when using our kernel. In Information Ex- 
traction (and particularly in Relationship Extraction), the 
quality of the results produced is based on two metrics: re- 
call and precision. 

Recall gives the ratio between the amount of information 
correctly extracted from the texts and the information avail- 
able in texts. Thus, recall measures the amount of relevant 
information extracted and is given by Equation |13| 



(13) 



The values of recall and precision may enter in conflict. 
When we try to increase the recall, the value of precision 
may decrease and vice versa. The F-measure was adopted 
to measure the general performance of a system, balancing 
the values of recall and precision. It is given by Equation 
ED 



F-measure — 



(0 2 + 1) X P X R 
P 2 x P + R 



(15) 



where R represents the recall, P represents the precision, /3 
is an adaptation value of the equation that allows to define 
the relative weight of recall and precision. The value /3 can 
be interpreted as the number of times that the recall is more 
important than accuracy. A value for ft that is often used is 
1, in order to give the same weight to recall and precision. In 
this case, the F-measure value is obtained through Equation 

ns 



F, = 



P + R 



(16) 



5.4 Significance Tests 

In order to support our claims during the comparison of each 
pair of kernels, we relied significance tests. We used a the 
paired t-test between each pair of kernels that we wanted to 
compare directly. Details about this significance test can be 
found on most statistics text books [5]. 



For a given metric presented in Section [5.3[ we give as input 
to the test the result obtained for each split of the dataset. 
Our claims are based on a significance level of 5%. 



where C represents the number of correctly extracted rela- 
tionships while P represents the total number of relation- 
ships that should be extracted. The disadvantage of this 
measure is the fact that it returns high values when we ex- 
tract all possible pairs of entities as a relationship regardless 
of them being related or not. 

Precision is the ratio between the amount of information 
correctly extracted from the texts and all the information 
extracted. The precision is then a measure of confidence on 
the information extracted and is given by Equation |14| 

1 ftp:/ / ftp.es. utexas. edu/pub/mooney/bio- 
data/interactions.tar.gz 



5.5 Implementation Details 

Our experiments used the SVM package jLIBSVlvQ a Java 
port of LIBSVM that allows for easy customization when 
using different kernels. During the experiments, we used 
most of the default parameters of jLIBSVM. The only ex- 
ception was the parameter C of the SVM (which controls the 
trade-off between the errors of the SVM and the size of the 
margin). For this parameter, after some empirical experi- 
mentation we fixed its value in 50 for all the experiments. 

We used the OpenNLlQ module for sentence detection and 

2 http : / /dev. davidsoergel.com /trac/j libsvm / 

3 http : //incubator . apache, org /opennlp/ 



Kernel 


Recall 


Precision 


Fi 




Kernel 


Recall 


Precision 


Fi 


FGK 


41.51% 


58.94% 


48.25% 




FGK + SPK 


45.21% 


59.60% 


51.83% 


SPK 


43.47% 


56.73% 


48.86% 




FGK + NSPK 


40.84% 


57.56% 


47.34% 


NSPK 


37.69% 


58.47% 


45.39% 




SPK + NSPK 
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ALL 


46.31% 


59.01% 


51.64% 



Table 2: Performance of the individual kernels on 
the Aimed data set. 



the Stanford parsei]^] for the word segmentation, POS tag- 
ging and generation of the labeled dependency graph. 

Finally, we used Parallel Coh|^]to perform the matrix oper- 
ations necessary for our kernel. 

5.6 Performance of the Individual Kernels 

Our first experiment aimed at understanding how each of 
the individual kernels that we proposed (i.e., FGK, SPK 
and NSPK introduced in Section 4.5 I performs. Table [2] 
shows the results of this experiment. 



Table 3: Performance of the individual kernels on 
the Aimed data set. 



The explanation for this surprising result has to do with the 
definition of these kernels. On one side, SPK was designed 
as a good solution to distinguish between candidates gener- 
ated from the same sentence and associated with different 
pairs of entities. On the other side, NSPK is good to ana- 
lyze the whole structure of the dependency graph but it does 
not distinguish very well the candidates generated from the 
same sentence. Thus, these two kernels are good at distin- 
guishing very different contexts of the candidates. For this 
reason, they end up being a good complement to each other. 



The results obtained are according to what was expected. 
First, the individual kernel that obtains the highest value 
of Fi is SPK. Knowing how the shortest path hypothesis 
has been exploited with success in several other works, this 
comes with no surprise. Even though the average value of Fi 
for SPK is higher than that for FGK, the difference is not 
statistically significant according to the significance tests. 

If we look only at the average values of recall and precision 
presented in Table [2j it seems that SPK is the best kernel 
in terms of recall and FGK is the best in terms of precision. 
However, by comparing the results obtained by these two 
kernels using the significance tests the differences are not 
significant for both these metrics. 

Another result that is not surprising is the fact that the 
performance of NSPK is very poor. As discussed before, 
this kernel does not distinguish very well candidates that are 
generated from the same sentence but are associated with 
different pairs of entities. This reflects in a drastic drop of 
the recall value. 

5.7 Performance of the Combination of Ker- 
nels 

After analyzing the performance of the individual kernels, 
we evaluated the performance of the kernels that result from 
their combination. We considered the following four combi- 
nations: (i) FGK + SPK; (ii) FGK + NSPK; (iv) SPK + 
NSPK; and (Hi) ALL = FGK + SPK + NSPK. 

Table [3] shows the results of this experiment. Given the per- 
formance of the individual kernels reported before, it was 
expected that the best combination of kernels would be ei- 
ther the one that combines all the individual kernels (ALL) 
or the one that combines the two best individual kernels 
(FGK + SPK). In fact, the results show that regarding the 
average values of recall, precision and F 1 , the best combina- 
tion is actually SPK + NSPK. 



Even though SPK + NSPK obtained the best average val- 
ues of recall, precision and Fi, it is important to note that 
according to the significance tests, it is not fair to claim 
that it is a superior solution in comparison to FGK + SPK 
and ALL since the differences for all the metrics were not 
statistically significant. 

Another interesting observation has to do with the terrible 
results obtained by FGK + NSPK. It is the kernel combi- 
nation with worst results in all the metrics. Moreover, the 
significance tests indicated that in terms of recall and Fi 
measure, the differences in comparison to the other combi- 
nations were significant. These results are also related with 
the type of information that the two individual kernels try to 
analyze. Recall that FGK is actually a modified and more 
refined version of NSPK in which vertexes and edges of the 
shortest path between the candidate entities are treated dif- 
ferently. For this reason, most of the information exploited 
by both kernels is the same, which makes their combination 
a little bit redundant. 

Finally, we wanted to compare the combination kernels with 
the individual kernels to understand whether it pays off 
to use the combinations. For each metric, we compared 
the combination kernels with the individual kernel with the 
highest value of the metric as presented in Table [2] First, 
in what concerns recall, we observe that the differences be- 
tween SPK and most of the combinations is not significant. 
The only exception is SPK + NSPK. In what concerns 
precision, we compared with FGK and we observed that 
the gains from using the combinations in this case are not 
significant. For, the comparison regarding Fi, most of the 
combination kernels significantly outperform SPK. The 
only exception is FGK + NSPK. In fact, if we compare 
FGK + NSPK with both kernels that originate it, we no- 
tice that the differences in terms of Fi between them are not 
statistically significant. This is interesting because it illus- 
trates how combining two kernels does not necessarily mean 
that the results will improve. 



4 http://nlp. stanford.edu/software/lex-parser.shtml 
5 http://sites. google.com/site/piotrwendykier/software/ 
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Table 4: Performance of the individual kernels on 
the Aimed data set 
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SPK + NSPK + U4 
SPK + NSPK + W 

+ l n U n 
SPK + NSPK + |14| + |8| 


49.38% 
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68.36% 


55.43% 

54.23% 
54.12% 
55.14% 



Table 5: Performance of the individual kernels on 
the Aimed data set 



NSPK with [14] . Moreover, even the combination of SPK+ 
NSPK with [8] is able to outperform the combination of 14 
and [81. 

In order to understand these results, recall that [l4] is based 
on several kernels including information of n-grams in three 
different locations of the sentence: before the first entity, be- 
tween the entities and after the second entity. Knowing that 
n-grams are among the subsequences of the sentence, it is 
easy to undestand that there is some overlapped information 
when combining these two kernel. 

When these kernels are combined with SPK + NSPK, we 
are joining information from completely different sources: 
sequences and dependency graph. For this reason, the kernel 
we propose is very interesting when used in combinations 
with kernels from different sources. 



5.8 Comparison with Other Methods 

In order to compare the performance of our solution with 
other methods, we implemented two additional kernels de- 
scribed in the literature: (i) a kernel based on shallow lin- 
guistic information of the sentences, |14| ; and (ii) a kernel 
based on subsequences, [§]. During these experiments we 
always compared these kernels with our combination of ker- 
nels that showed better performance on the average values 
of the recall, precision and Fx: SPK + NSPK. Table [4] 
shows the results of this experiment. 

The most evident conclusion obtained by observing the re- 
sults is that our solution is still outperformed by the shallow 
linguistic information kernel in terms of average values of the 
metrics. However, the significance tests for all the metrics 
indicate that the differences between SPK + NSPK and 
[14| are not significative. 

If we compare SPK + NSPK with |], the results are very 
different. In fact, the results of the significance tests show 
that there are significant differences between these two ker- 
nels in terms of recall and precision (SPK + NSPK is bet- 
ter in terms of recall and [8] is better in terms of precision). 
However, in terms of Fx , the differences are not significative 
(even though the SPK + NSPK obtains an higher average 
value of Fx). 

The differences of the results of precision and recall of SPK+ 
NSPK and [l4| in comparison to [8] are something worth 
mentioning: the precision values are not as high as in the 
subsequences kernel but the values of recall are significantly 
higher. This is interesting because it goes against a typical 
trend in works on supervised RE in which the values of pre- 
cision tend to be very high but the values of recall tend to 
be very low. 

5.9 Combination with Other Kernel Methods 

Finally, we performed some experiments to evaluate how 
combining SPK + NSPK with other methods influences 
the results. Once again, we used the two kernels that we 
compared our solution to in Section |5.8| Table [5] presents 
the results of this experiment. 

By analyzing the results obtained in this experiment, we ob- 
serve that the best combination is the one that joins SPK + 



We also wanted to determine whether the difference of the 
results of these combinations in comparison to the individ- 
ual kernels was significative. Thus, wc performed signifi- 
cance tests between SPK + NSPK, [II], [|] and all their 
combinations presented in Table [5] 

In what concerns recall, the differences between the combi- 
nations, SPK + NSPK and [l4] are not significative. How- 
ever, the tests indicate that all the combinations are able to 
outperform [8]. This comes with no surprise knowing that 
the differences in terms of average value of recall were very 
high. 

Regarding precision, the significance tests show that com- 
bining SPK + NSPK with all the other kernels have a sig- 
nificant impact. The tests also obtain the same result for 
[14]. With [8] the results are different: none of the combi- 
nations is able to significantly outperform [8]. 

When comparing the results of the significance tests for Fx , 
there is only one combination that is able to clearly outper- 
forms SPK + NSPK and [14] . This combination is actually 
the one that combines both these kernels. In all the other 
cases, the differences are not significative. Regarding [8], all 
the combinations are able to significantly outperform it in 
terms of Fx. 

6. CONCLUSIONS AND FUTURE WORK 

This paper proposes a solution for Relationship Extraction 
(RE) based on labeled graphs kernels. The proposed kernel 
is a particularization of the Random Walk Kernel for generic 
labeled graphs presented in JTt] . In order to make the kernel 
suitable for RE tasks, we exploited two properties typically 
used in this line of work: (i) the words between the candi- 
date entities or connecting them in a syntactic representa- 
tion are particularly likely to carry information regarding the 
relationship; and (ii) combining information from distinct 
sources in a kernel may help the RE system to make bet- 
ter decisions. Our experiments show that the performance 
of our solution is comparable with the state-of-the-art on 
RE. Moreover, we showed that combining our solution with 
other methods for RE leads to significant gains in terms of 
performance. 

Interesting topics for future work include the study of differ- 



ent parameterizations of the Random Walk Kernel for RE. 
Namely, we want to try different kernels for vertex and edge 
labels as well as different probability distributions associated 
to the vertexes and the transitions. Moreover, it would be in- 
teresting to compare this kernel directly with other methods 
and test the combination of other kernels with ours. Finally, 
we would also like to test our solution with other datasets, 
namely the ACE dataset, which is composed by documents 
containing a wide variety of relationships (e.g., CEOJDF, 
Located-In) involving several types of entities (e.g., person, 
organization, location). 
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